-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add performance benchmarks to user facing docs #12595
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -0,0 +1,1059 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.04 #12595 +/- ##
===============================================
Coverage ? 85.73%
===============================================
Files ? 155
Lines ? 24889
Branches ? 0
===============================================
Hits ? 21339
Misses ? 3550
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Prem for starting this work. This is very important!
" \"numbers\": np.random.randint(-1000, 1000, 10_000_000, dtype='int64'),\n", | ||
" \"business\": np.random.choice([\"McD\", \"Buckees\", \"Walmart\", \"Costco\"], size=10_000_000)\n", | ||
"})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use cupy here for the gdf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily because cupy
doesn't support str
types yet.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"pandas_read_parquet = time_it(lambda : pd.read_parquet(\"pandas.parquet\"))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the reader functions, we need to consider cache clearing. I always use this command when benchmarking file systems: os.system("/sbin/sysctl vm.drop_caches=3")
Also, we don't know where the file is actually going for the user that runs this! It could be a local drive, it could be a network drive, it could be a virtual drive that is actually a network drive. Generally faster drives allow our readers to show faster speedups. We may want to compare host buffers and files, and provide some analysis based on the comparison.
All that said, for the purposes of this notebook I think your approach is the best one. We could include more documentation about how IO works. We could also add a drive read speed test, and report performance relative to that. I'm looking forward to discussing this more with you.
@@ -0,0 +1,1699 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. pandas_upper = timeit.timeit(lambda : pd_series.str.upper(), number=20)
Can we write all these benchmarks in a more reusable style? Maybe track the results in a dictionary?
def bench(pdf, gdf, func, **kwargs):pdf_time = timeit.timeit(func(pdf), **kwargs) gdf_time = timeit.timeit(func(gdf), **kwargs) return pdf_time, gdf_time
upper = bench(pd_series, gd_series, lambda df: df.str.upper(), number=20)
contains = bench(pd_series, gd_series, lambda df: df.str.contains(r"[0-9][a-z]"), number=20)
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
@galipremsagar looking to run the notebook this week - think thats possible? |
Let's add a symlink in the |
Done 👍 |
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. gdf
Combine with the above cell. No need to show both pdf and gdf outputs. Readers will trust that's correct.
Reply via ReviewNB
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #9. gd_obj : cuDF object
Let's name this cudf_obj
. Match the module name (pd, cudf).
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see bradley has the same idea.
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. pdf = pdf.head(100_000_000)
Why did we create 300M rows, only to trim it to 100M? It's a bit confusing to define num_rows
and then not benchmark data with that number of rows. Let's use 100M or 300M everywhere.
I see below that we redefine num_rows = 1_000_000
with a new dataset. That's okay - but within the benchmarks of a given dataset, we should be consistent so that it doesn't look like we're cherry-picking data sizes to show the best speedup (especially for smaller datasets like 1M or 100M instead of 300M).
Reply via ReviewNB
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. _ = gc.collect()
If you're trying to avoid showing the result, just write gc.collect();
with a semicolon to prevent displaying the output. Assigning the result looks odd.
Reply via ReviewNB
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. gd_series = cudf.from_pandas(pd_series)
Combine so all this is in one cell:
num_rows = 300_000_000
pd_series = ...
gd_series = ...
Reply via ReviewNB
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. gdf_age = cudf.from_pandas(pdf_age)
Combine this cell with the cell above.
Reply via ReviewNB
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #3. pdf['key'] = np.random.randint(0,2,size)
Add spaces after the commas.
Separate issue: We should enable the nbqa
pre-commit hook with black/isort/etc. for our repo... example: https://github.com/glotzerlab/signac-examples/blob/6c0f1efdaf87d361f29d8e035025a87fccf4f57d/.pre-commit-config.yaml#L40-L52
Reply via ReviewNB
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1568 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. !lscpu
FYI: This benchmarks a CPU released in 2019 against the H100 GPU made available in 2022-2023. I saw that the website benchmarks Allan collected used an AMD EPYC 7642, which is also a 2019-2020 era CPU, with an A100 (released in May 2020). That's a more fair comparison.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Particularly since a "good" implementation should be bandwidth-bound, and the max single-socket spec-sheet bandwidth on one of these chips is 140GB/s (ballpark). Probably single-core stream will top out at 20GB/s.
This enables `black` and `isort` linters for ipynb notebooks via [nbqa](https://github.com/nbQA-dev/nbQA). I propose this change to avoid manually linting notebooks like #12595. cc: @galipremsagar Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #12848
@bdice and @galipremsagar think we could wrap this up so the site can launch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for now, so that the notebook can be linked from the new rapids.ai site. We will address further review comments in a follow-up PR. (Discussed offline with @galipremsagar and @exactlyallan)
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the different sections can we have a little bit of setup? Possibly with some linking to user-guide docs as well?
Reply via ReviewNB
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. gdf
gdf is a somewhat impenetrable name (if the pandas dataframe is called pdf
why is the cudf dataframe not called cdf
?). If this is didactic, perhaps use pandas_df
and cudf_df
throughout?
Reply via ReviewNB
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #9. plt.show()
Is it possible to also show absolute time as a separate subplot (or a right second y-axis)? If pandas is only half a second the speedup is still impressive, but is less motivating than if one goes from 30 seconds to a fraction of a second.
Reply via ReviewNB
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #6. _ = gc.collect()
Although I think I know what you're doing, this is kind of an odd thing to have to do at all. It's not the kind of thing you would necessarily write as part of a normal analysis I think.
Does it make sense to instead define the benchmarks inside functions that set up the data, run, and collate results. Then the dataframes don't escape the scope of the function call and will get collected automatically.
In particular, the requirement to do gc.collect()
(presumably to clean out cycles) is an anti-pattern that readers of this notebook may well cargo-cult.
Reply via ReviewNB
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. )
It would be nice if were able to make:
- strings that have less baggage associated with them
- Pick some distribution of lengths that is perhaps realistic. One could either have a normal distribution, or else perhaps a heavy-tailed log-normal. Here's some recent discussion on statistical models of sentence length: https://aclanthology.org/W19-5710.pdf
Reply via ReviewNB
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. num_rows = 10_000_000
The size of the data keeps on changing. Maybe there is a good reason for this, but if there is, I think it should be spelled out. If not, then it looks odd.
Reply via ReviewNB
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. return 1
And idiomatic writing of this function would be: return int(row.isupper())
does that _work_?
Reply via ReviewNB
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, if we collate the individual runs then we can use a box-and-whiskers or violin plot to show all of this data in one, I think.
There's also a question of what the fairest comparison is. I think, if possible, we should also show the "first run" cost if loading the jitted code from the disk cache (I think that's a thing?). Since you might run the same workflow many times from scratch, but if the cache persists that's the number you care about.
Reply via ReviewNB
@@ -0,0 +1,1647 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. plt.show()
Does this plot add anything that the (single-entry) table does not?
Reply via ReviewNB
/merge |
Description
Resolves: #12295
This PR introduces a notebook of benchmarks that users will be able to run if they download the notebook. The notebook also generates graphs which are going to show up in cudf python docs.
Checklist